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Abstract 

Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large 
number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor 
local minima. In this paper, we rigorously show that such local minima can be avoided fupto an approximation 
error) by using the dropout technique, a widely used heuristic in this domain. In particular, we show that by 
randomly dropping a few nodes of a one-hidden layer neural network, the training objective function, up to a 
certain approximation error, decreases by a multiplicative factor. 

On the flip side, we show that for training convex empirical risk minimizers (ERM), dropout in fact acts as 
a “stabilizer” or regularizer. That is, a simple dropout based GD method for convex ERMs is stable in the face 
of arbitrary changes to any one of the training points. Using the above assertion, we show that dropout provides 
fast rates for generalization error in learning (convex) generalized linear models (GLM). Moreover, using the 
above mentioned stability properties of dropout, we design dropout based differentially private algorithms for 
solving ERMs. The learned GLM thus, preserves privacy of each of the individual training points while providing 
accurate predictions for new test points. Finally, we empirically validate our stability assertions for dropout in 
the context of convex ERMs and show that surprisingly, dropout significantly outperforms (in terms of prediction 
accuracy) the L2 regularization based methods for several benchmark datasets. 



1 Introduction 


Recently, deep belief networks (DBNs) have been used to design state-of-the-art systems in several important learn¬ 
ing applications. An important reason for the success of DBNs is that they can model complex prediction functions 
using a large number of parameters linked through non-linear gating functions. However, this also makes training 
such models an extremely challenging task. Since there are potentially a large number of local minimas in the space 
of parameters, any standard gradient descent style method is prone to getting stuck in a local minimum which might 
be arbitrarily far from the global optimum. 


A popular heuristic to avoid such local minima is dropout which perturbs the objective function randomly by drop¬ 
ping out several nodes of the DBN. Recently, there has been some work to understand this heuristic in certain 
limited convex settings Baldi and Sadowski | 2014] , [Wager et al. [ 2013| . However, in general the heuristic is not 
well understood, especially in the context of DBNs. 


In this work, we first seek to understand why and under what conditions dropout helps in training DBNs. To this end, 
we show that for fairly general one-hidden layer neural networks, dropout indeed helps avoid local minima/stationary 
points. We prove that the following holds with at least a constant probability: dropout decreases the objective 
function value by a multiplicative factor, as long as the objective function value is not close to the optimal value 
(see Theorem]!]). To the best of our knowledge, ours is the first such result that explains performance of dropout for 
training neural networks. 


Recently in a seminal work, Andoni et al. [ 2014| showed rigorously that a gradient descent based method for neural 
networks can be used to learn low-degree polynomials. However, their method analyzes a complex perturbation to 
gradient descent and does not apply to dropout. Moreover, our results apply to a significantly more general problem 
setting than the ones considered in Andoni et al. |2014| ; see Section [2~T| for more details. 


Excess Risk bounds for Dropout. Additionally, we also study the dropout heuristic in a relatively easier setting of 
convex empirical risk minimization (ERM), where gradient descent methods are known to converge to the global 
optimum. In contrast to the above mentioned “instability” result, for convex ERM setting, our result indicates that 
the dropout heuristic leads to “stability” of the optimum. This hints at a dichotomy that dropout makes the global 
optimum stable while de-stabilizing local optima. 

In particular, we study the excess error incurred by the dropout method when applied to the convex ERM problem. 
We show that, in expectation, dropout solves a problem similar to weighted L 2 -regularized ERM and exhibits fast 
excess risk rates (see Theorem[3]). In comparison to recent works that analyze dropout for ERM style problems [Baldi] 
and Sadowski] (2014| ,[Wager et al. |2014| , we study the general problem of convex ERM in generalized linear model 
(GLM) and provide precise generalized error bounds for the same. See Section [Tl] for more details. 


Private learning using dropout. Privacy is a looming concern for several large scale machine learning applications 
that have access to potentially sensitive data (e.g., medical health records) Dwork |2006|. Differential privacy 


Dwork et al. [2006b| is a cryptographically strong notion of statistical data privacy. It has been extremely effective 


in protecting privacy in learning applications |Chaudhuri et aL 120111, Duchi et al. | 2013| , Song et al. [20131, Jain 
and Thakurta||2014 |. 


As mentioned above, for convex ERMs, dropout can be shown to be “stable” w.r.t. changing one or few entries in 
the training data. Using this insight, we design a dropout based differential private algorithm for convex ERMs (in 
GLM). Our algorithm requires that, in expectation over the randomness of dropout, the minimum eigenvalue of the 
Hessian of the given convex function should be lower bounded. This is in stark contrast to the existing differentially 
private learning algorithms. Most of these methods either need a strongly convex regularization or assume that the 
given ERM itself is strongly convex. 


Experimental evaluation of dropout. Finally, we empirically validate our stability and “regularization” assertion 
for dropout in the convex ERM setting. In particular, we focus on the stability of dropout w.r.t. removal of training 
data, i.e., LOO stability. We study the random and adversarial removal of data samples. Interestingly, a recent works 
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by Szegedy et al. [20131 and Maaten et al. [ 2013J provide a complementary set of experiments: while we study 
dropout for adversarial removal of the training data, |Szegedy et al. [2013 1 studies adversarial perturbation of test 
inputs and Maaten et al. [2013 considers corrupted features. Our experiments indicate that dropout engenders more 
stability in accuracy than Lo regularization(with appropriate cross-validation to tune the regularization parameter). 
Moreover, perhaps surprisingly, dropout yields a more accurate classifier than the popular L 2 regularization for 
several datasets. For example, for the Atheist dataset from UCI repository, dropout based logistic regression is 
almost 3% more accurate than the L 2 regularized logistic regression. 


Paper Organization: We present our analysis of dropout for training neural networks in Section[2] Then, Section[3] 
presents excess risk bounds for dropout when applied to the convex ERM problem. In Section |4j we show that 
dropout applied to convex ERMs leads to stable solutions that can be used to guarantee differential privacy for the 
algorithm. Finally, we present our empirical results in Section [5] 


2 Dropout algorithm for neural networks 


In this section, we provide rigorous guarantees for training a certain class of neural networks (which are in particular 
non-convex ) using the dropout heuristic. In particular, we show that dropout ensures with a constant probability that 
gradient descent does not get stuck in a “local optimum”. In fact under certain assumptions (stated in Theorem [T}, 
one can show that the function estimation error actually reduces by a multiplicative factor due to dropout. |Andonij 


et al. [2014] also study the robustness properties of the local optima encountered by the gradient descent procedure 


while training neural networks. However, their proof applies only for complex perturbation of gradient descent and 
only for approximating low-degree polynomials. 


Problem Setting. We first describe the exact problem setting that we study. Let the space of input feature vectors be 
X C M p . Let V be a fixed distribution defined on X. For a fixed function / : X — > M, the goal is to approximate 
the function / with a neural network (which we will define shortly). For a given estimated function g : X —> M, the 
error is measured by \\g — f \\ 2 D = E [\g(x) — f(x)\~\. We also define inner-product w.r.t. the distribution V as 

(g , x)x> = E \g{x) f (x)). We now define the architecture of the neural network that is used to approximate /. 


Neural network architecture. We consider a one-hidden layer neural network architecture with mn nodes in the 
hidden layer. Let the underlying function / be given by: f(x) = ai^i((6*,x)), where fi : M p —> M is the link 

function for each hidden layer node i. For simplicity, we assume that the coefficients a,; > 0 arc fixed VI < i < m. 
The goal is to learn parameters 9* E M p for each node. Also, let t a r = d. The training data given for the learning 
task is {(x, f(x))} xr ^T>- Note that Andoni et al. [2014] also studies the same architecture but their link functions (pi 
are assumed to be low-degree polynomials. 


Dropout heuristic. We now describe the dropout algorithm for this problem. At the t -th step (for any t > 1), sample 
a data point (*»/(*)) V and perform gradient descent with learning rate g: 9\ = 0* 1 — gS/gt (£{f, g, x)), 
where V e t is the gradient of the error in approximation of /. That is, 

£(f,g,9 t -x) = (/(x) -g(x)) 2 , 

where g(x) = x )) i s the Nth step approximation to /. 

Now, if the procedure is stuck in a local minimum, then we use dropout perturbation to push it out of the local 
minima. That is, select a vector where each h; ~ un jf {0,1}. Now, for the current estimation g (at time 

step t) we obtain a new polynomial g as: 

9{x) = 2 ^ cnbigfx), 

i£[m] 

where gt{x) = (fiO-ix)). We now perform the gradient descent procedure using this perturbed g instead of the true 
t -th step iterate g. 
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We now analyze the effectiveness of the above dropout heuristic for function approximation. We would like to 
stress that the objective in this section is to demonstrate instability of local-minima in function approximation w.r.t. 
dropout perturbation. This entails that if the gradient descent procedure is stuck in a local minima/stationary point, 
then dropout heuristic helps get out of local minima and in fact reduces the estimation error significantly. However, 
the exposition here does not guarantee that the dropout algorithm reaches the global minimum. It just ensures that 
using dropout one can get out of the local minimum. 

Theorem 1. Let f = ai(f>i((0*, x}) : X —> M be the true polynomial for X C M p , where 4>i represents the i-th 

node’s link function in the neural network. Let (an, • • • , a m ) represent the weights on the output layer of the neural 

network , with a m i n = min |aj|. Let g = %)) = Ei ■ X —> M be the current estimate of f. 

ie[m] " * 

Let V be a fixed distribution on X from which the training examples (x, /(x)) are drawn. If \\g\\x> > Il/Ilx> an d 

Mil max ||3i|||,Vm 

\\g — f Hi, > - —f -’ th en with probability at least 1 /8 over the dropout, the dropped out neural network 

g satisfies the following. 

Also, E [p(x)] = g(x) for all x G X. 

bi,i£[m] 


At a high level, the above theorem shows that if the estimation error \\g — f \\ 2 D is large enough, then the following 
holds with at least a constant probability: g, which is a dropout based perturbation of g , has significantly lesser 
estimation error than g itself. Next in Section 2.1 we apply our results to the problem of learning low-degree 
polynomials and compare our guarantees with those of|Andoni et al.|[2014 |. 


Proof of Theorem [7] Let e = g — f be the error polynomial for the approximation g and let e = g — f be the error 
polynomial for the approximation g. We have the following identity. Here A g = g — g, and the last step in (jTJ is for 
notational purposes. 


| [| g{x) - /(x)| 2 ] = E [|s(x) + A< 7 (x) - /(x)| 2 ] 

= | [|flt(a:) - fix )| 2 ] + E [|Ap(x)| 2 ] + 2E [((fif(x) - f(x))Ag(x))] 

E [|e(x)| 2 ] - E [|e(x)| 2 ] = E [|A 5 (x)| 2 ] + 2E [{(g(x) - f(x))Ag{x))] 

X L J X L J X L J X 

Pile ~ Pile = IPfflle + 2(Ag, g — /)x> (1) 

A B 


We first analyze the term B. Notice that one can equivalently write the polynomial g 
Pi ~unif {-!,!}• We have. 


m 

g as OiiPigi, where 
2=1 


E 


'y ] a iPi{gi,g f)v 
»e[m] 


= 0. 


( 2 ) 


In order to lower bound B in ([TJ, along with ([2]) we now need to lower bound the variance of B, i.e., lower bound 

m 

the random variable Z = (Ag,g — /)|,. By the randomness of pfs we have E [Z] = a “i{9ii9 ~ f)v- ^ so 

n i=1 

we have E P 2 I < 2 (Et a i(9i,g- f)vY ‘. Using standard Payley-Zigmund anti-concentration inequality, we 


have Pr [Z > E[Z]/2] > . Plugging in the bounds on Z from above we have Pr[Z > E[Z]/2] > 1/4. 
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E[Z 2 ] = E 

fi fi 


= E 

M 


^2°nm{gi,g - f) 


V 


\ 2 — 1 


{aicijakai) ■ (muj/ikneiigug - f)v(gj,g - f)v{gk,g - f)v{gi,g - f)v 




< 2 E 


aiOij Hi/ij (gi, g f)v{gj-,g f)v 
2^a 2 a){g il g - f)x>{gj,g~ f)v 


hi 


< 


2 ( ^2»i(gi,g- f)l 


(3) 


In Q we lower bound E[Z] as follows. In ([5]» we have used the assumption that \\g\\j) > ||/|||,. Using the fact in ([2]) 

A 4 

and ([5]), it follows that with probability at least 1/4, we have 2(A g, g — f)x> > — II5 — /||p- 

m 

E i z ] = 'Yh a i( a igiig - f)v 

a. * ^ 


(4) 


> 


i=l 

^min 

m 


^2 \{(*igi,g- f)v I 


\i= 1 


> 


-(g,g-fYv = ^\\g-m 

m 4m 


(5) 


Now we focus on the term A in ([TJ and provide an upper bound. We have E [|| A^|||,] = Using the 

t 4 i=l 

l|a||| max \\gi\\^,s/rh 

bounds on the terms A and B one can conclude that if \\g — /|||, > - 1 i2/a ~~ -' l ^ cn w ' l h probability at least 

1/8, ||5 - /HI < (i - ypj) lb - f\\v- This completes the proof. □ 


2.1 Application: Learning polynomials with neural networks 


The work of Andoni et al. [2014] studied the problem of learning degree-d polynomials (with real or complex 
coefficients) using polynomial neural networks described above. In this section we provide a comparative analysis 
of Andoni et al. | ]2014[ Theorem 5.1] with Theorem [T] above. The approach of Andoni et al. [ |2014 1 is different 
from our approach in two ways: i) For the analysis of Andoni et al. [2014] to go through, the perturbation has to be 
complex, and ii) They consider additive perturbation to the weights as opposed to the multiplicative perturbation to 
the nodes exhibited by dropout. 

In order to make the results comparable, we will assume that for each of the node i, \g t \\ -p = 0(1), and a r = 0(1). 


(Since Andoni et al. [2014] deal with complex numbers, these bounds above are on the modulus.) Under this 
assumption, Theorem TTsuggests that the error can be brought down to 0{m^/m) where as Andoni et al. [2014, 
Theorem 5.1] show that the error can be brought down to 0(mp d ). Notice that our bound is independent of the 
dimensionality ( p ) and the degree of the polynomial (d). In terms of the rate of convergence, Andoni et al. [2014 


Theorem 5.1] ensures that the error reduces by U(1 — 1 /rn d ) factor, while in our case it is Q(1 — 1 /y/m). Another 
advantage of Theorem [T] is that it is oblivious to the data distribution V, as opposed to the results of Andoni et al. 


12014] which explicitly require V to be either uniform or Gaussian. 
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Algorithm 1 Dropout gradient descent 

Input: Data set: D = (xi, yi), • • • , (x n , y n ), loss function: £, learning rate: rj, dropout probability: a, T: Number 
of iterations of SGD. 

1: Choose initial starting model: 9\. 

2: for t G [T] do 

3: Sample (x t , yi) ~iid D, b ~ unif {0, 1} P . 

4: e t+ i G- 0 t - % SJ £({e t ,b* x),y). 

5: end for 
6: Output: &T+ 1 - 


3 Fast rates of convergence for dropout 


In the previous section we saw how dropout helps one come out of local minimum encountered during gradient 
descent. In this section, we show that for generalized linear models (GLMs) (a class of one layer convex neural 
networks), dropout gradient descent provides an excess risk bound of 0(l/n), where n is the number of training 
data samples. 

Problem Setting. We first describe the exact problem setting that we study. Let t(V) be a fixed but unknown 
distribution over the data domain V = {(x,y) : x G A . y G J2}, where X C R p is the input feature domain and 
y C M is the target output domain. Let the loss £(9; x, y) be a real-valued convex function (in the first parameter) 
defined over all 0 G M p and all (x, y) G V. The population and excess risk of a model 6 are defined as: 


Risk($) 

ExcessRisk(0) 


E [£(9-,x,y)\, 

( x,y)~ T (V) 


Risk(0) — min Risk(0 / ), 


( 6 ) 


where C C M p is a fixed convex set. A learning algorithm A typically has access to only a set of samples D = 
{(xi, yi), • • • , (x n , y n )}, drawn i.i.d. from t(V). The goal of the algorithm is to find 9 with small excess risk. 


Dropout Heuristic. We now describe the dropout based algorithm used to minimize the Excess Risk (see ((6])). At a 
high-level, we just use the standard stochastic gradient descent algorithm. However, at each step,a random a-fraction 
of the coordinates of the parameter vector are updated. That is, the data point x generated by the stochastic gradient 
descent is perturbed to obtain NiGP where the i-th coordinate of b * x is given by Now, the perturbed 
b* x is used to update the parameter vector 9. In this section, we assume that the sampling probablity a = 1/2. See 
Algorithm [T] for the exact dropout algorithm that we analyze. 


We also analyze a stylized variant of dropout that can be effectively captured by a standard regularized empirical 
risk minimization setup. (See Appendix |A.1| ) Both of these analyses hinge on the observation that even though the 
loss functions are not strongly convex in general, the dropout valiants of these loss functions are strongly convex 
in expectation and enable us to derive an excess risk of 0(l/n) in both cases. Recall for non-strongly convex loss 
functions in general, the lower bound on excess risk is 0(1/y/n) Shalev-Shwartz et al. [2009]. 


Assumption 2 (Data normalization), i) For any (x, y) G V, \ x \ \ 2 < B, and ii) The loss function £(u\y) is 1-strongly 
convex in u (i.e., 9 > 1) and G-Lipschitz (i.e, \ \ < G). 


In Theorem [3] we provide the excess risk guarantee for the dropout heuristic. 

Theorem 3 (Dropout generalization bound). Let C C he a fixed convex set and let Assumption [2] be true for the 
data domain T> and the loss £. Let D = {(xi, y 1 ), • • • , (. x n , y n )} be n i.i.d. samples drawn from t(T>). Let Risk(0) 
he defined as in Let the learning rate r)t = -^j. Then over the randomness of the SGD algorithm and the 
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distribution t(T>), we have excess risk 


. t ^M 2 (x*b,9);y)] - min E [£(2(x*b, 6)-y)] = O 

(x,y)r^r(T)),b O^C (x,y)~r(T>),b 


Here A = minE [x'(j) 2 ], Ai = min ^ ^2 xfip) 2 , and G and B are defined in Assumption^ The outer expectation 


jebl x ie[p] ' i= i 

is over the randomness of the algorithm. 


(GB) 2 logT (GB) 2 logn 
A fif + An 

0 




The proof of this theorem is provided in Section |a ~2 Observe that if T > n 2 , and A,Ai are assumed to be 
constants, then the excess risk bound of Theorem [3] is 0(l/n). Second note that the bound is for the dropout risk 
defined in ( fl3| ). For the special case of linear regression (see Lemma |4|), dropout-based risk is the true risk ([6} plus 
L 2 regularization. Hence, in this case, using standard arguments Shalev-Shwartz et al. [2009] we get the 1 /y/n 
excess risk rate for population risk defined in <|6]). However, for other loss functions, it is not clear how close the 
dropout based risk is to the population risk. 

Lemma 4. Let b be drawn uniformly from |0, 1 } p and let f(2(x * b, 9): y) be the least squares loss function, i.e., 
l(2( X ,e)-y) = (2( X *b,0)-y) 2 .Then, 


Risk{9) = E [(y - (x, 0)) 2 ] + 9 T E [diag(xx T )] 9. 

(x,y)~T(V) (x,j/)~r(D) 


(V) 


See Appendix |A.2|for a detailed proof of Theorem[3]and Lemma|4] 


Note. Notice that even when x t x[ is not full rank (e.g., all the x % are scaled versions of the p-dimensional vector 


1 — 1 


(1,1, • • •)), we can still obtain an excess risk of 0(l/n) for the dropout loss. Recall that in general for non-strongly 
convex loss functions, the best excess risk one can hope for is 0(1/y/n) Shalev-Shwartz et al. [2009]. 


3.1 Comparison to related work 


After the seminal paper of Hinton et al. [ 2012] , demonstrating strong experimental advantage of “dropping off” 
of nodes in the training of deep neural networks, there have been a series of works providing strong theoretical 
understanding of the dropout heuristic Baldi and Sadowski 12014[, Wager et al. 12013[, Wang et al. [20131 


van 


Erven et al. |2014| , Helmbold and Long 120141, Wager et al. 120141, |McAllester 12013 [, Maaten et al. 120131. A 


high-level conclusion from all these works has been that dropout behaves as a regularizer, and in particular as an 
L -2 regularizer when the underlying optimization problem is convex. In terms of rates of convergence, the work of 
Wager et al. | |2013[ provide asymptotic consistency for the dropout heuristic w.r.t. convex models. They show (using 
second order Taylor approximation) that asymptotically dropout behaves as an adaptive L 2 -regularizer. The work of 
Wager et al. [20141 provide the precise rate of convergence of the excess risk when the data is assumed to be coming 
from a Possion generaive model, and the underlying optimization task is topic modeling. For the classic problem 
of linear optimization over a polytope, dropout recovers essentially the same bound as follow the perturbed leader 
Kalai and Vempala 1 20051 while bypassing the issue of tuning the regularization parameter. 

In this work we extend this line of work further by providing the precise (non-asymptotic) rate of convergence of the 
dropout heuristic for arbitrary generalized linear models (GLMs). In essence, by providing this analysis, we close 


the fourth open problem raised in the work of van Erven et al. [2014] which posed the problem of determining the 
generalization error bound for GLMs. One surprising aspect of our result is that the rate of convergence is ()(1 /n) 
(as opposed to 0(1/y/n)), even when the underlying data covariance matrix ffj xpxj is not full-rank. 
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4 Private convex optimization using dropout 


In this section we show that dropout can be used to design differentially private convex optimization algorithms. 
In the last few years, design of differentially private optimization (learning) algorithms have received significant 
attention |Chaudhuri and Monteleoni] [2008 [ , |Dwork and Lei| [20091 , IChaudhuri et al.| [20111 , |Jain et ah] [2012] , 
Kifer et al. 120121, Duchi et al. 120131, Song et al. [20131, Jain and Thakurta| 120141, Bassily et al. 120141. We 
further extend this line of research to show that dropout allows one to exploit properties of the data (e.g., minimum 
entry in the diagonal of the Hessian) to ensure robustness, and hence differential privacy. Differential privacy is 
a cryptographically strong notion which by now is the de-facto standard for statistical data privacy. It ensures the 


privacy of individual entries in the data set even in the presence of arbitrary auxiliary information Dwork [2006 


■2008] . 


Definition 5 ((e, 5)-differential privacy Dwork et al. [2006b a|). For any pairs of neighboring data sets D, D' £ T>" 
differing in exactly one entry, an algorithm A is (e, 5)-differentially private if for all measurable sets S in the range 
space of A the following holds: 

Pr[^(D) £S]<e £ Pr [A{D') £ S] + 5. (8) 

Here, think of 5 = l/vA^ and e to be a small constant. 


Background. At an intuitive level differential privacy ensures that the measure induced on the space of possible 
outputs by a randomized algorithm A does not depend “too much” on the presence or absence of one data entry. 
This intuition has two immediate consequences: i) If the underlying training data contains potentially sensitive 
information (e.g., medical records), then it ensures that an adversary learns almost the same information about an 
individual independent of his/her presence or absence in the data set, and hence protecting his/her privacy, ii) Since 
the output does not depend “too much” on any one data entry, the Algorithm A cannot over-fit and hence will 


provably have good generalization error. Formalizations of both these implications can be found in Dwork [2006] 


and Bassily et al. |2014( Appendix F]. This property of a learning algorithm to not over-fit the training data (also 


known as stability) is known to be both necessary and sufficient for a learning algorithm to generalize Shalev-Shwartz 
et al.| [ 2009) 2010] , Poggio et al. 120111, Bousquet and Elisseeff [2002 1. 


we 


In the following we provide a stylized example where dropout ensures differential privacy. In Appendix [B] 
provide a detailed approach of extending this example to arbitrary generalized linear models (GLMs). (See Section 
[3]for a refresher on GLMs.) 


4.1 Private dropout learning over the simplex 

In this section, we analyze a stylized example: Linear loss functions over the simplex. The idea is to first show that 
for a given data set D (with a set fixed set of properties which we will describe in Theorem[6]), the dropout algorithm 
satisfies the differential privacy condition in ([8]) for any data set I)' differing in one entry from D. (Lor the purposes 
of brevity, we will refer to local differential privacy at D.) Later we will use a standard technique called propose- 
test-release (PTR) framework Dwork and Lei [20091 to convert the above into a differential privacy guarantee. (The 
details are given in Algorithm [2]). 

Let the data domain X be {0,1} P and let X = {x \..., x n } be i.i.d. samples from a distribution t(X) over X. Let 
the loss function i be £((9, x)) = (9, x) and let the constraint set C be the p-dimensional simplex. Let b±, ■ ■ ■ ,b n be 
i.i.d. uniform samples from {0, 1} P . The dropout optimization problem for the linear case can be defined as below. 
Here the i- th coordinate of b * x is given by b-iX,. 


n 


9 = arg min 
eec 


^2(xi * bi,9) 
2=1 


(9) 
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Algorithm 2 Private dropout learning over the simplex 

Input: Data set: D = {x\, • • • , x n } C {0,1} P , C = p-dimensional simplex, privacy parameters e and 5. 

n 

1: Vj G \p],c(j) G- min ^ ^ x'*(j) and A = rninc(j). 

fce[n] *= 1 ^^ je[p] 

2: A «— A + Lap(l/ne) and Vi E [n], b t ~ un jf {0,1} P . 

3: If A > then output 9 = arg min fF (xi * bi,9), else output failure. 

ne i=1 


At a high level Theorem [6] states that changing any one data entry in the training data set D, changes the induced 
probability measure on the set of possible outputs by a factor of e, and with a additive slack that is exponentially 
small in the number of data samples (n). 

n 

Theorem 6 (Local differential privacy). Let c(j ) = min - x t (;j) and A = min c{j). For the given data set 

fce[n] 71 i=i^ k j£\p] 

X if A > 1 /{ne) and e < 1, then the solution 6 of dropout ERM ([9]) is (e, 5)-local differentially private at X, where 
5 = pexp (—fl(e 2 An)). 

Proof First notice that since we are optimizing a linear function over the simplex, the minimizer 9 is essentially one 
of the coordinates in \p\. Therefore one can equivalently write the optimization problem in ((9]) as follows. Flere x( j) 
refers to the j-th coordinate for the vector x. 


j = arg min V' x % ( j ) b t ( j ) 


( 10 ) 


n 


W.l.o.g. we assume that the neighboring data set V differs from V in x n . Also, let f(j, D) = Xi(j)bi(j). Clearly 

i= 1 

for any j, |/(j, D) — f(j. D')\ < 1. In the following we show that the measures induced on the random variable j by 
(fTOj) for data sets D and D' have multiplicative closeness. The analysis of this part closely relates to the differential 


privacy guarantee under binomial distribution from Dwork et al. [2006a |. 


For a given j G [p], let Vj be the number of non-zeroes in the j-th coordinate of all the xfs, excluding Therefore, 
we have the following for any k < u f. 


Pr[/(j,P) = (% + fc)] < V ~j + k+ 1 
Pr[f(j,D')=("i + k)] ~ v i-k 


So, as long as k < f , the ratio in ( fTT| ) is upper bounded by (1 + e) < e e for e < 1 . By Chernoff bound, such an 
event happens with probability at least exp(—fl(e 2 iz,j). For the lower tail of the binomial distribution, an analogous 
argument provides such a bound. 

One can use the same argument for other coordinates too. Notice A = - min u r By union bound, for any given 

n je[p] 

j G [p], the ratio of the measures induced by /(j, D) and f(j. I)') on any k G Z which has probability measure at 
least 1 — pexp (—0(e 2 An)), is in [e~ e , e e ]. 


In the following we notice that not only individually each of the coordinates satisfy the multiplicative closeness in 

measure, in fact arg min f(j, D) and arg min /(j, I)’) satisfy analogous closeness in measure, where the closeness 

ie[p] je[p] _ 


is within \e 2e , 


-,2el 


. This property follows by using Bhaskar et al. [2010 Theorem 5]. This concludes the proof. □ 















(a) 




Figure 1: Stability analysis for logistic regression on Atheist data set. The experiments were repeated 20 times, the 
means of which are plotted, (a) and (b) show, for random removal of training examples, how the error, and marginal 
error vary with p. (c) and (d) shows the same for adversarial removal. 



Figure 2: (a), (b): Stability of linear regression with Boston housing data set (random sub-sampling), (c), (d): 
Stability analysis for DBN with MNIST data set (under random sub-sampling). 


4.1.1 From local differential privacy to differential privacy 


Notice that Theorem [6] (ensuring local differential privacy) is independent of the data distribution r(D). This has 
direct implications for differential privacy. We show that using the propose-test-release (PTR) framework Dwork 
and Lei [ 20091, Smith and Thakurta |2013[, the dropout heuristic provides a differentially private algorithm. 


Propose-test-release framework. Notice that for any pair of data sets D and D' differing in one entry, A (D) and 
A(79'j_ in Theorem |6l differs by at most 1/n. So using the standard Laplace mechanism from differential privacy 


Dwork et al. 


[2006bJ, one can show that A = A (D) + Lap(l/ne) satisfies (e, 0)-differential privacy, where Lap(A) 


is random variable sampled from the Laplace distribution with the scaling parameter of A. With A in hand we check 
if A > 21o "[ ) 1/ ^ . For the condition being true, we output 6 from ([9]) and output a _L otherwise. Theorem [ 7 ] ensures 
that the above PTR framework is (e, 5) -differentially private. 

Theorem 7. Propose-test-release framework along with dropout (i.e., Algorithm [2]) is (2e, 5)-differentially private 
for optimizing linear functions over the simplex, where e < 1 and 5 = l/n u ^\ 


This theorem is a direct consequence of Theorem|6]and Thakurta [ 2015) . Using the tail property of Laplace distribu 
tion, one can show that as long as A in Theorem 


6|is at least 41 °s^/ d ) , w p a t least 1 — 5 the above PTR framework 


outputs 6 from ([9]) exactly. While the current exposition of the PTR framework is tuned to the problem of optimizing 
linear functions over the simplex, a much more general treatment is provided in Appendix |B| 


5 Experiments 

In this section, we provide experimental evidence to support the stability guarantees we provided for dropout in 
Section [4] (for more extensive results, refer Appendix |D|). We empirically measure stability by observing the effect 
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on the performance of the learning algorithm, as a function of the fraction of training examples removed. This 
measure captures how dependent an algorithm is on a particular subset of the training data. We show results for 
GLMs as well as for deep belief networks (DBN’s). We compare against the following two baseline methods 
(wherever applicable): a) unregularized models and b) L 2 -regularized GLM’s. We describe our experimental setup 
and results for each of these model classes below. 


Stability of dropout for logistic regression We introduce perturbations of two forms: a) random removal of training 
examples and b) adversarially remove training examples. 

Random removal of training examples : For a given p 6 [0,1], we train a model on a randomly selected (1 — p)- 
fraction of the training data. We report the test error and the difference in mean test error which is the absolute 
difference between the test error and the baseline error (the test error obtained by using the complete training dataset). 
We refer to this difference as the marginal error. 


We present results on the benchmark Atheist dataset from the 20 newsgroup corpus; total number of examples = 
1427, dimensionality: 22178.We use 50% of the data for training and the remaining for testing and use a dropout 
rate of 0.5. We measure the error in terms of fraction of misclassified examples. Figure [lja,b) shows the results 
for different values of p when training a logistic regression model with no regularization, L 2 regularization, and 
two variants of dropout: “standard” dropout Hinton et al. 120121, and deterministic dropout outlined in Wang and 


Manning 120131. 


We observe that the dropout variants exhibit more stability than the unregularized or the L 2 regularized versions. 
Moreover, deterministic dropout is more stable than standard dropout. Notice that, even though dropout consistently 
has a lower test error than other methods, its effectiveness diminishes with increasing p. We hypothesize that 
with decreasing amount of training data, the regularization provided by dropout also decreases (see Section [3] and 
Appendix [AT}. 

Adversarial removal of training examples : Let D = (xi,yi), • • • , ( x n ,y n ) be a given training set. Let 0f u || be 
a model learned on the complete set D. For a given value of p, we remove the np samples x E D which have 
minimum |(x, #f u ii)|. The rest of the experiment remains the same as in the random removal setting. Figure [ljc,d) 
shows the test error and the marginal error for different regularization methods w.r.t. p in this adversarial setting. 


As with random removal, dropout continues to be at least as good as the other regularization methods studied. 
However, when p > 0.5 observe that dropout’s advantage decreases very rapidly, and all the methods tend to 
perform similarly. 


Stability of linear regression Next, we apply our methods to linear regression using the Boston housing dataset 
Bache and Lichman 2013 ] (with 506 samples and 14 features) for our experiments. We use 300 examples for 


training and the rest for testing. Figure [2] (a), (b) shows that the marginal error of dropout is less than that of the 
other methods for all values of p. Interestingly, for small values of p, dropout performs worse than L 2 regularization, 
although it performs better at higher values. Here we use a dropout rate of 0.05, and we measure the mean squared 
error. 


Stability of deep belief networks: While our theoretical stability guarantees hold only for generalized linear models, 
our experiments indicate that they extend to deep belief networks (DBN) too. We posit that the dropout algorithm 
on DBN’s (after pre-training) operates in a locally convex region, where the stability properties should hold. 

We use the MNIST data set for our DBN experiments. Experiments with other data sets are in Appendix |D.2 
MNIST dataset contains 60000 examples for training and 10000 for testing. For training a DBN on this data set, we 
use a network with four layer^j] We use 784, 800, 800, and 10 units in each layer respectively. Our error measure is 
the # of misclassifications. 

As in the previous experiments, we measure stability by randomly removing training examples. See Figure[2](c), (d) 
for test eiTor and marginal error of dropout as well as the standard SGD algorithm applied to DBNs. Similar to the 

1 We use the gdbn and noleam python toolkits for training a DBN. 
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GLM setting, we observe that dropout exhibits more stability and accuracy than the unregularized SGD procedure. 
In fact for the case of 50% training data, dropout is 16% more accurate than SGD. 
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A Fast rates of convergence for dropout optimization 

A.l Empirical risk minimization (ERM) formulation of dropout 


For simplicity of exposition, we modify the ERM formulation to incorporate dropout perturbation as a part of the 
optimization problem itself. We stress that the ERM formulation is for intuition only. In Section [3} we analyze the 
stochastic gradient descent (SGD) variant of the dropout heuristic and show that the excess risk bound for the SGD 
variant is similar to that of the ERM variant. 

Given a loss function £, convex set C, and data set D = {(xi, y±), • • • , (x n , y n )} which consists of n i.i.d. samples 
drawn from t(V), fitting a model with dropout corresponds to the following optimization: 

1 n 

6 = argmin- VV(2(x; * bi,8);yi), (12) 

dec n 

1=1 

where each 6,; is an i.i.d. sample drawn uniformly from {0, 1 \ p , and the operator * refers to the Hadamard product. 
We assume that the loss function £{u, y) : M 2 —> M is strongly convex in u. For example, in the case of least-squares 
lineal - regression the loss function £(2(x, 9);y) is (y — 2(x, 0)) 2 . 

Lemma 8. Let b be drawn uniformly from {0, 1 } p and let the expected population risk be given by 


Risk(0) = E [£(2 (x*b,8)\y)] 

(x,y)~ T (V),b 


(13) 


Let t{u. y) be a a-strongly convex function w.r.t. u. Then, the expected population risk is a - A strongly convex 
w.r.t. 9, where A = min E [x(j) 2 ] and x(j ) is the j-th coordinate of x. 

ie[p]x~r(x>) 

Proof. Now, 


V^Risk (8) = 


E 

(x,y)~r(V) 


E 


y 4a E 

(x,y)~r(V) 


,d 2 £(2(x *b,8);y) T 

4 {l * b){xtb) . 

\d\ag(xx T ) + \xx T >- aA, 

4 4 — 


where second to last inequality follows by strong convexity of £ and from the fact that b is sampled uniformly from 

{0,1}P. □ 

An immediate corollary to the above lemma is that for normalized features, i.e., E[^N x(j) 2 } = 1, the dropout risk 
function ( [131 ) is the same as that for L 2 regularized least squares (in expectation). 

Corollary 9. Let b be drawn uniformly from {0, 1} P and let £(2(x * b. 8): y) be the least squares loss function, i.e., 
£(2(x, 9)\y) = (2(x *b,9) — y) 2 . Then, 

Risk (8) = E [( y - (x, 6»)) 2 ] +d T E [diag(xx T )] 8. (14) 

(x,3/)~t(D) (x,y)~T{P) 


Next, we provide an excess risk bound for 8 , the optimal solution to the dropout-based ERM ( fT2l ). Our proof 
technique closely follows that of Sridharan et al. |2008 1 and crucially uses the fact that Sridharan et al.| |2~008| only 
requires strong convexity of the expected loss function. Below we provide the risk bound. 


Theorem 10 (Dropout generalization bound). Let C AMP be a fixed convex set and let Assumption [2] be true for 
the data domain V and the loss t. Let D = {(xi, yf), • • • , (x n , y n )} be n i.i.d. samples drawn from t(T>). Let 
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V = {6i, • • • , b n } be n i.i.d. vectors drawn uniformly from {0,1} P . Let 6 = argmin L Y ^(2 (xi *bi,9)\yi ) and let 

0eC n i=l _ 


Risk($) be defined as in G3>- Then, w.p. >1 — 7 (over the randomness of both D and V), we have the following: 

f(GB) 2 log(l/ 7 )\ 


E 


Risk(0) — min Risk(0) 
QdC 


= o 




An 




Here A = minE \x(j) 2 ], and the parameters G and B are defined in Assumption^] 

J'6[P] 1 U 

Proof Define go as: ge(x, y, b) = £(2(x * b, 6); y) — £(2(x * b, 9*)\y), where 9* = argmin E [l(2{x *b,Q)\y)\. 

eeC (; x,b,y ) 

Also, let Qg = {(jg : 9 G C}. Following the technique of Sridharan et al. [2008], we will now scale each of the gfs 
such that the ones which have higher expected value over (x, y, b) have exponentially smaller weight. This helps us 
obtain a more fine-grained bound on the Rademacher complexity, which will be apparent below. 

:0SC, k a (9) = min < k' G Z + : E [gg\ < ci4 k ' 11. 

I (x,y,b) } } 

plexity bounds Bousquet et al. ]2004[ Theorem 5], for any 9 G C, the following holds (w.p. > 1 — 7 over the 
randomness in selection of dataset H): 


Let G a = < g a e = 


— ) n a — 9e 
4 k a (0) 


Using standard Rademacher com- 


sup 

gCiQa 


E [q(x, y, 6)] - - V q(xi, y i} h) 

(x,y,b) ” ^ 


n 


i= 1 


< 2TZ(Q a ) + 


sup \q(x,y,b)\ 


l°g(l/ 7 ) 

2 n 


(15) 


Flere 1Z refers to the Rademacher complexity of the hypothesis class. In the following we will bound each of the 
term in the right hand side of (13] ). 

Lemma 11. Let A = minE \x(j) 2 ], where x(j ) refers to the j-th coordinate of x. We claim that 

j6[p] 1 


2 a 


sup \q(x,y,b)\ <2(GB)\I 
g£Ga,(x,y,b) 


Proof By the definition of the bound on the domain of x and assumption on i, we have || \/g £(2{x * 6, 9)\y) ||2 < 
2GB. Therefore V0 G C, q G Q a - (x, y) G V,b G {0, 1} P , we have the following. 


, , \ 9 e(x,y,b)\ ^2GB\\9-9*\\ 2 

\^y^\ ^ 4M0) ^ ^—• 


(16) 


In the following we now bound \\9 — 0*||2- Using Lemma 8 E [£(2(x*b,9)-y)\ is strongly convex and hence using 

x,y,b 

optimality of 9*, we have: 


\\0 - 0*\\ 2 < i -r E \gg(x,y,b)\< 
\ A x,y,b 


2 a ■ 

A 


(17) 


where the last equations follows using definition of k a (9). 


□ 


Now, directly using Sridharan et ah [2008 Lemma 7], we can bound the Rademacher complexity in (T5]l by lZ(Q a ) < 


4 a / . Therefore we can bound 


15 


as follows. 


sup 


1 n 

E [q(x, y, 6)]-V q(x h y u h) 

(x,y,b) Tl 


i= 1 


= 0\GB 


^/ alog(l/ 7 )J 


(18) 
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Now notice that for any a > 0, w.p. at least 1 — 7 we have the following from (18 1 . 


^Age(x,y,b)\ - y2ge(xi,yi,bi ) = 4 fca(0) 

x,y,b n 


1=1 


E\g a e (x,y,b)\ -V ]gg(xi,yi,bi) 

x,y,b Tl L —f 


i=l 


4 M 10 


( 19 ) 


When A; a (0) = 0, ([T9]) implies the following. 


E [p e (x,y,6)]- 'Y'ge{x i ,y i ,bi) = O i GB 

x,y,b Tl *—' ' 


i=l 


alog(l/ 7 ) \ 


An 




( 20 ) 


When k a (9) > 0, one has 4 k "( (>> l a < E [ge{x, y, b)]. Substituting this in ( fl9| ) and by rearranging the terms, we 

x,y,b 

have the following. Here M = £ GB J a 1 °^,y 7 ' > for some constant £ > 0. 


E \g e (x,y,b)] < --— 

x,y,b 1 — 4 M/a 


n 


^ge{xi,yiM 


1=1 


( 21 ) 


Setting a = 8 M, and combining <f2~i~[) and <201 for the cases k n (Q) = 0 and k a {0) > 0 completes the proof. 


□ 


A.2 Proof of Theorem [3] (Generalization bound for dropout gradient descent) 


Proof. Let 

1 n 

J(0-D) = - V E [l(2{x i *b,9)-y i )\. (22) 

n f 6~{o,i}p 
1=1 

Let ti (0) = £(2{x t * bt , 9)\ yf) for the ease of notation, where ay , y t and bt are the parameters used in the f-th iterate. 
Over the randomness of the SGD algorithm, we have the following: 


Additionally, we have the following. 


vP(0-,d) = -Te 


E MtW] = VT(0;L>). 
d 2 k(2 (xi * b, 9);yi ) 


n z — 4 b 
i= 1 


A4_y 

n 


i =1 L 


<9(2 Xi * 6, 9) 2 
^diag {xixf) + ^xixf 


Xi * 6 ) (ay * b) J 


A Ax. 


implies that J(9 ; D) is Ai-strongly convex. Using Theorem [l2| we obtain the following: 


(23) 


(24) 


E 


J(9t; D ) — min J(9; D) 

0(zC 




where the expectation is over the randomness of the SGD algorithm. 
Notice that by the definition of J{9] D) in ([22]), we have 

Risk(#) = E \J(9:D)}. 

Dr^T(D) n 

Theorem now follows by using ( [25] ) and Theorem [T3[ 
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(25) 


□ 
























is a is 


Theorem 12 (Convergence of expected stochastic gradient descent Shamir and Zhang [2013]). Suppose J(9) 

Ai -strongly convex function, and let the stochastic gradient descent algorithm be 6t+ 1 = fl c(9t — t]t V ^t(9t)) such 


that E[v^t($)] = \Z J(0) for all 8 6 C. Additionally, assume that E[|| v £t{9t) | 
rate r)t = l/(Ai t), then for any T > 1 the following is true. 

L 2 log 


= 0(L 2 ) for all t. If the learning 


E 


J {8) - mm J (8) 

UdzL 


= o 


Ai T 


All the expectations are over the randomness of the stochastic gradient descent algorithm. 


Now in the following we state a variant of Sridharan et al. [2008 Theorem 1]. The only difference is that we use the 


strong convexity of Risk(0) in the guarantee instead of the strong convexity of -HO: D). The proof of this variant is 
exactly the same as Sridharan et al. [2008 Theorem 1]. 

Theorem 13 (Fast convergence for strongly convex objective Sridharan et al. [2008]). Let D ~ r(D) n . Over the 
randomness of the data distribution t(T>), for all 9 £ C, the following is true w.p. > 1 — 7 . 

Risk($) — min Risk(0) = O ( J(0; D) — min J{8] D) + —— ^ ^ /t) \ . 

0£C \ 9(zC l\Tl J 

Here A = minE [x(j) 2 ]. 

je\p\ x 


B Differentially private learning for GLMs using dropout 


To generalize our stability result for linear losses to the generic GLM regression setting we first provide a model 
stability result for the dropout-based gradient descent. At a high-level Theorem [14] ensures that if the dropout 
gradient descent (from Section [ 3 ]) is executed for T = n 2 iterations, then by changing any one data entry in the 
training data set only changes the model by at most 1/n (in the L 2 -norm). 

Theorem 14 (Model stability of dropout gradient descent). Let C C W p be a fixed convex set and let the data domain 
T> and loss £ satisfy Assumption [2] with parameters G, B. Let D = { (x \, y \), • • • , (x n , y n )} be n samples from V, 

n 

and let A = 1 min min Xi(j) 2 and learning rate pt = & r ^ ^ e ^ ie num ^ er of time steps for which 

gradient descent is executed. Then gradient descent with dropout ensures the following property for any data set D' 
differing in one entry from D. 


¥. D j\\A{D) - A{D')\\ 2 ] 


(GB ( llogT max{Ai/A, A/Ai} 1 
\T{\ T + n 


Proof. We will follow the notation from Section[3]for convenience. Recall that 

n 

J(9 ; D) = ^ E [^(2 (xi * b, 9)\yfi] and i t {9) = £(2(x t * h, 9)-,yt), where xt, yt and are the parameters 

i= i&~{0,1}p 

used in the f-th iterate. By the same argument as in ( |24| ) we have that J(0; D) and ./(0: D') are A-strongly convex, 
where D' is any neighboring data set of I). 

Let 8t(D) and 9t(D') the outputs of SGD on data sets D and D'. Similarly, let dUD ) = argmin J(9] D) and 

0eC 

9 1 (D') be defined analogously. Using an immediate variant of Theorem [T 2 ] we conclude that for 0j (i.e., the 9 ob¬ 
tained after running the SGD algorithm for T), E [J(9t(D); D) — J(9^(D); D )] = O ^ GB ) lo g rm ^{ A i/ A ' A /Ai} j 
. By the A-strong convexity of J(f): D) and Jennsen’s inequality, we have the following: 


E 


\\8 t (D)-9\D)\\ 2 


^^/logTmax{Ai/A,A/Ai} 


(26) 
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The same bound holds for the data set D' too. In order to complete the stability argument, we show that on a 

neighboring data set D' , (P (D) does not change too much in L 2 -norm. W.l.o.g. assume that D and D' differ in the 

n-th data entry. Therefore, by strong convexity and the property of the minimizers (P(D) and (P (I)'), we have the 

following. For brevity we represent /($; x, y) = F E [£(2(x * b, 9) ; y)]. 

6~{0,1} P 


A, 


J(0\D’)-D) > j(e\D)- D ) + - II e\D r ) - e\D)\\ 


A, 


^ J(eHD')-D) + f(0\D')-y' n , 4) - ( J{6\D)-D) + f(d\D)-y' n , x' n )) > - \\9\D') - 0'(D)\\l 
+ y' n , x' n ) - /(0 f (L>);4,4)) 

4^ J{0\D')-D') - J{6\D)-D') > 11| 6\D') - 6\D )||1 + (f(e\D')-y' n ,x' n ) - f(6'(Dy,y' n , X 'S) 

A, 




(f(e\Dy,y' n ,x' n ) - f(eHD');y' n ,x' n )) > -\\d\D') - 9\D)\\ 


\\e\D')-e\D)h = o{^. 


(27) 


Now notice that Theorem[l2|is true for any data set. Therefore combining ( [27] ) and (26), we have E [|| 9t{D) — 6t(D' 


O 


GB 

A 


logT max{Ai/A,A/Ai} 


+ 


This completes the proof. 


□ 


Here we want to highlight two interesting properties of this theorem: i) stability guarantee holds even if all the data 
points {xP s) are aligned in the same direction, i.e., when the Hessian is rank deficient, and ii) the rate of stability is 


similar to the one that can be obtained by adding an Ly-rcgulari/er Chaudhuri et al. [20111, Kifer et al. [20121. 

Unlike in Section [4~i~| we are only able to show model stability for GLMs. One can see that local differential privacy 
cannot be achieved in general for this setting because the support of the distribution over the output model can be 
different for neighboring data sets and the randomness is multiplicative. However, by modifying the gradient descent 
algorithm in Section[3]to add random perturbation one can achieve local differential privacy. Specifically we use the 
Gaussian mechanism from the differential privacy literature (Nikolov et al. 12013 Lemma 4] and Appendix |B.2[ ). 
But in order to add the Gaussian perturbation, we need Theorem 14 to hold w.h.p. We use boosting scheme to 
achieve this. 


Boosting of stability. Let 
algorithm (see Section 


, uf 


be the models from k independent runs of the dropout gradient descent 

.O'). 


3i. Let j* = arg min J(6f, ; D), where J(0; D) = - V E [£(2{xi * b, 6); yi)]. If 
J 3 e[fc] i=i f>~{o,i}p 

k = log(l/(>) and e moc j equals the bound in Theorem [l~4| then w.p. > 1 - <5, we have: || 6^\d) - 9^*\d')|| 2 < 

emodV^m- Formally, 

Theorem 15 (High probability model stability of dropout gradient descent). Let k = log(l/4) be the number of 
independent runs of stochastic gradient descent. Following the notation of Theorem [77] with probability at least 
1 — 5 over the randomness of the algorithm, we have the following. 


\\e { p(D)-d { p(D')h = o 


Here D' is any neighboring data set of D. 


( GB^\og(l/5) 




A 


'logTmax{Ai/A, A/Ai} 1 
T + n 


Proof. Let (P = arg min J(0; D ). Notice that by Markov’s inequality and the bound from Theorem 

0ec 


conclude that w.p. at least 1/2, for a given j € [A;], J(9D) — J{9^\ D) = O 


12 


one can 


— m I (GB)“ logTmax{Ai/A,A/Ai} 


AT 


. For 
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k = log(l/4) independent runs of the stochastic gradient descent, it follows that with probability at least 1 — 4, 
J(4T>; D) - = O ('(g B ) 2l °g r Ml/^)max{A 1 /A,A/A 1 }^ 

Now following the rest of the argument exactly as Theorem |T4} we get the required bound. 


□ 


Now, using standard results from differential privacy literature 




e2 mod l ° el 1 / 5 ) 


Nikolov et al. 


12013], we can show that §t = 0^ ^ + 


) is (e, 6)-locally differentially private at the data set I). Furthermore, using the fact that Risk(d) 


is 0{GB ) Lipschitz and the results of Jain and Thakurta 12014 Theorem 1] and Chaudhuri et al. [ 201 1[ Theorem 
3], we have: 


#>) 


= Go. 


1. Improper learning bound: E[|Risk(0^ ') — Risk(0, 

2. Proper learning bound: E[|Risk(flc(#y '*)) — Risk(0^ ^)|] = GBffpo. 


Notice that the improper learning bound is independent of any explicit dependence on p. Also if T = n 2 and if we 
consider A and all parameters except n to be constants, then the above bounds indicate that the excess error due to 
the Gaussian noise addition is O (1/n), which is similar to the excess risk bound of the standard dropout algorithm 
(Theorem [T0|). Using arguments similar to Bassily et al. [20141 one can show that this bound is tight. 


B.l Converting dropout optimization to a differentially private algorithm 

Notice that all of our stability guarantees in Section[4]are independent of the data distribution t(D). This has direct 


implications for differential privacy. We show that using the propose-test-release (PTR) framework Dwork and Lei 
[2009|, Smith and Thakurta |2013[, the dropout heuristic provides a differentially private algorithm. 


The (e, 4)-local differential privacy guarantee from Sections 4.1 and from the previous section can be interpreted as 


follows. For a given algorithm A and a level of accuracy o (or a bound on the extra randomness for local differential 
privacy), there exists a deterministic function g^ a : V" —y M. If (ja.ct(B) exceeds a threshold (, then A(D) is 
(e, 4)-local differentially private stable at D. We drop the dependency of g on A and a in the sequel. 

As an example, consider the local differentially private variant of dropout from the previous section. Recall that to 
achieve local differential privacy, we introduced Gaussian perturbations to the model. If we require that the standard 
deviation of the noise added is bounded by o (corresponding to a given level of accuracy) and we expect (e, 4)- 
differential privacy, then it suffices for A (defined in Theorem 14) to be at least max { 0 (\/log T/T + ^)) , %-}, 
and for g(D) = A. 

For the settings in this paper, g has bounded sensitivity (i.e., for any pah - of neighboring data sets I) and D', 

| g(D) — g(D')\ < rj). Notice that in our above example, with g(D) = A, the sensitivity of g is g = ^ (where B 
upper bounds the L 2 norm of any feature vector x). For differential privacy, bounded sensitivity is essential. The 
following two-stage algorithm ensures (e, 4)-differential privacy for dropout. 

1. Noisy estimate of g:g 4 — g(D) + Lap(ry/e), where Lap(A) is drawn from 

2. Test for safety: If g > £ + (77 • log(l/4)/e), then execute dropout heuristic, otherwise fail. 

One can show using tail bounds that if g{D) > £ + (77 • log(l/4)/e), then w.p. at least 1 — 4, the above algorithm 
will output the execution of the dropout heuristic exactly. 


Optimality of error. Regularized private risk minimization is a well-studied area Chaudhuri and Monteleoni [20081, 
Dwork and Lei [ 2009] , Chaudhuri et al. [2011] , pain et al. [ 2012[ , Kifer et al. [2012] , Duchi et al. 1 2013] , Song 
et al. [ 2013] , Jain and Thakurta 120141, Bassily et al. 120141. Here we provide yet another algorithm for private 
risk minimization. While the previous algorithms primarily used variants of L 2 regularization, we use the dropout 


19 




































































































regularization for the privacy guarantee. Conditioned on the PTR test passing, the utility of this algorithm follows 
from the learning bounds obtained for the local differentially private version of dropout which depends on the noise 


level a. While the bounds are tight in general Bassily et al. [2014] and there exists other algorithms achieving the 
same bounds Bassily et al. 120141, Jain and Thakurta | |2014 , since dropout has been often shown to outperform 
vanilla regularization schemes, it might be advantageous to use dropout learning even in the context of differential 
privacy. 


B.2 Gaussian mechanism for differential privacy 


Gaussian mechanism. Let / : V n —y MP be a vector valued function. For a given data set D, the objective 
is to output an approximation to f(D) while preserving differential privacy. Let r/ = max ^ || f(D) — 

f(D') || 2 be the sensitivity of the function /. Gaussian mechanism refers to the following algorithm: Output 


f(D) + J\f ^0,I p 4r? One can show that the above algorithm is (e, 5) -differentially private. (See 


Nikolov 


et al. p013| Lemma 4] for a proof.) Using the proof technique of Nikolov et al. |2013[ Lemma 4] one can 


easily show that the theorem also holds for (e, <5)-local differential privacy at any given data set D £ V n , with 


C Implications of stability on generalization performance 


In this section we focus on the following question: What implication does stability have on the generalization 
performance? Several existing results indeed show formal connection between stability and the generalization per¬ 
formance (See Shalev-Shwartz et al. 12010 [ for a survey). However, the notion of local differential privacy is signif¬ 
icantly stronger than the other LOO stability notions used in the literature. This enables us to prove generalization 
error risk under adversarial perturbations to a small number of points. 


To this end, we first define the notion of distance of a dataset to “instability”: Let A be (e, 4)-locally differentially 
private at dataset D , then T is the distance to instability for A(D), if Ai l)') is not (e, e))-locally differentially private, 
where D' is obtained by changing at most T points in D. A similar notion of instability can be defined for model 
stability (i.e., the model does not change more than e mo d in the L 2 -norm). 


In the below theorems, we show that if less than T points of V are changed adversarially, then the generalization 
error of A does not increase significantly. That is, the algorithm can tolerate up to T adversarial noise. 


Theorem 16. Let f : C —> M be a L-Lipschitz continuous function. For a given data set D £ T)", if an algorithm 
A (with the range space of A in C) has T distance to ( mor \-model instability, then for any data set D' differing in at 
most m < T, the following holds. 


E [| f(A(D)) - f(A(D '))|] < Lme mod . 


(28) 


The proof Theorem [T6| above follows immediately by triangle inequality. 

Theorem 17. Let f : C —>• R + be a bounded function with max |/(x)| < B. For a given data set D £ T> n , if an 

xeC 

algorithm A (with the range space of A in C) has T distance to (e, 5)-local differential privacy, then for any data set 
I)' differing in at most m < T, the following holds. 

E [f(A(D'))] < e me E [f(A{D))] + mB6. (29) 
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Proof. By the definition of distance of a dataset to instability, and the composition property Dwork and Lei [2009] 
of differential privacy we know that for any data set D' differing in at most m < T records from D, there exists a 
set Q s.t. both over the randomness of A(D) and over the randomness of A(D'), i) the measure //( Q) < m<5, and 


ii) for all s € Range(A) \ Q, 
Therefore, 


nA(D)( s ) 


< me. 


E[f(A(D'))]= / f(s)y A ^ ) (s)ds + / f(s)p A ^ D o j (s)ds 


s£Q 


s€Q 


<e me J f(s)p A{D) (s)ds +B j p A ( D ')(s)ds 
sgQ s&Q 

< e m£ E[f(A(D))]+mBS. 


The last inequality is true because by assumption / only maps to positive real numbers and the total measure on the 
set Q is at most rnd. This completes the proof. □ 


The above given theorems show that for any f, D and D' provides nearly the same error up to multiplicative e me 
factor and additive mB5 factor, where 5 is poly(l/n). Hence, using / as the excess population-risk function, the 
above theorem shows that the excess risk due to adversarial corruptions is bounded by mB5, where m is the number 
of points changed, B is the absolution bound on function value and d is 0(poly(l/n)). 

We now give an explicit bound on the distance to instability for a simple linear loss problem. For the ease of 
exposition, we assume the number of dimensions p to be a constant and ignore poly log terms in n, and hide them in 
the O(-) notation. 


Example. Consider a model vector 9* E C and a distribution A7(0, £ 1 ), where £ is a diagonal covariance matrix 
with the diagonal being (Af, • • • , \ 2 ). And let D = {(xi, yi), • • • , ( x n , y n )} be a data set of n entries, where each 
Xj ~ ,V(0, 5T 1 ) and y, = (a;,;, 9*). For dropout regularization, the corresponding optimization problem is given in 

03- 


Let Ap = — 

x n 


min min W Xi(j) 2 

SC[n],\S\=rje{p] i= tfzs 


. Intuitively, Ap refers to the strong convexity of the expected objective 


function in (jT2j), when at most I’ entries are adversarially modified in the data set D. By the tail bound for Chi- 
squared distribution, one can show that Ap = H(l) with probability at least 1 — 1 /poly(«) as long as Ai, ■ ■ ■ . X p 
are H(l) and F = o(n). Therefore, for a given value of e and 6 and any data set D' such that \D/S.D'\ < 2T, the 
standard deviation a of the noise needed to obtain (e, ())-local differential privacy is 0 
for the above value of a, (|29j) in Theorem 17 is true for all m = d(n). 


. This implies that 


Similarly for model stability, one can show for m = d(n) in the above example. 


D Missing details from experiment section 

D.l Stability experiment for logistic regression 

Here we provide the results of the stability experiment for the following data sets i) Comp.graphics vs comp, windows 
in Figure [3j a) 1953 examples and 31822 features), and ii) Rec.sport.baseball vs sci.crypt in Figure j3jb) (with 1985 
examples and 24977 features). In all the data sets above, we use half the data for training and the rest for testing. 
For error, we measure the fraction of misclassification. 
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Figure 3: Stability analysis for logistic regression (random sub-sampling) 


D.2 Stability Experiment for DBNs 

Here we provide the stability result for the Leaves data set. The data set consists of images of 10 different classes of 
leaves. The leaves data set consists of 8000 images of leaves. We use 7000 for training and 1000 images for testing 
(by splitting randomly). For training a DBN on this data set we use a network with a configuration of 784, 1024, 
1024, and 10 units respectively. For error, we measure the # of misclassification. 

Note. Each DBN was trained for 150 epochs, and experiment was repeated five times. 




P 


Figure 4: Stability analysis for DBN with Leaves data set (under random sub-sampling). 
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